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Abstract 

We consider the problem of closeness testing for two discrete distributions in the practically relevant 
setting of unequal sized samples drawn from each of them. Specifically, given a target error parameter 
e > 0, mi independent draws from an unknown distribution p. and m 2 draws from an unknown distri¬ 
bution q, we describe a test for distinguishing the case that p = q from the case that | \p — g||i > e. If 
p and q are supported on at most n elements, then our test is successful with high probability provided 
mi > n 2/3 /e 4/3 and m 2 = O (maxj ^T _ 2 , ; we show that this tradeoff is optimal throughout 

this range, to constant factors. These results extend the recent work of Chan et al. m who established 
the sample complexity when the two samples have equal sizes, and tightens the results of Acharya et al. 
0 by polynomials factors in both n and e. As a consequence, we obtain an algorithm for estimating 
the mixing time of a Markov chain on n states up to a logn factor that uses 0(n 3 ^ 2 T m i X ) queries to a 
“next node” oracle, improving upon the 0(n 5 ^ 3 T m i X ) query algorithm of (8). Finally, we note that the 
core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both 
on synthetic data and on natural language data. 


1 Introduction 

One of the most fundamental problems in statistical hypothesis testing is the question of distinguishing 
whether two unknown distributions are very similar, or significantly different. Classical tests, like the Chi- 
squared test or the Kolmogorov-Smirnov statistic, are optimal in the asymptotic regime, for fixed distribu¬ 
tions as the sample sizes tend towards infinity. Nevertheless, in many modern settings—such as the analysis 
of customer data, web logs, natural language processing, and genomics, despite the quantity of available 
data—the support sizes and complexity of the underlying distributions are far larger than the datasets, as 
evidenced by the fact that many phenomena are observed only a single time in the datasets, and the em¬ 
pirical distributions of the samples are poor representations of the true underlying distributions^] In such 

This work is supported in part by NSF CAREER Award CCF-1351108. 

*To give some specific examples, two recent independent studies |!19l [25ll each considered the genetic sequences of over 14,000 
individuals, and found that rare variants are extremely abundant, with over 80% of mutations observed just once in the sample. A 
separate recent paper (El found that the discrepancy in rare mutation abundance cited in different demographic modeling studies 


1 




settings, we must understand these statistical tasks not only in the asymptotic regime (in which the amount 
of available data goes to infinity), but in the “undersampled” regime in which the dataset is significantly 
smaller than the size or complexity of the distribution in question. Surprisingly, despite an intense history of 
study by the statistics, information theory, and computer science communities, aspects of basic hypothesis 
testing and estimation questions-especially in the undersampled regime—remain unresolved, and require 
both new algorithms, and new analysis techniques. 

In this work, we examine the basic hypothesis testing question of deciding whether two unknown dis¬ 
tributions over discrete supports are identical (or extremely similar), versus have total variation distance at 
least e, for some specified parameter e > 0. We consider (and largely resolve) this question in the ex¬ 
tremely practically relevant setting of unequal sample sizes. Informally, taking e to be a small constant, 
we show that provided p and q are supported on at most n elements, for any 7 e [0, 1/3], the hypothesis 
test can be successfully performed (with high probability over the random samples) given samples of size 
mi = 0(?r 2 / 3+7 ) from p, and m 2 = 0(n 2 /' 3-7 / 2 ) from q. Furthermore, for every 7 in this range, this 
tradeoff between mi and m 2 is necessary, up to constant factors. Thus our results smoothly interpolate 
between the known bounds of 0 (n 2/:! ) on the sample size necessary in the setting where one is given two 
equal-sized samples [3 [ 3 , and the bound of 0 (\/n) on the sample size in the setting in which the sample 
is drawn from one distribution and the other distribution is known to the algorithm Il22l [29l . Throughout 
most of the regime of parameters, when mi <C m 2 , our algorithm is a natural extension of the algorithm 
proposed in J9[, and is similar to the algorithm proposed in @ except with the addition of a normalizing 
term. In the extreme regime when mi « n, our algorithm requires an additional statistic which appears to be 
new. Throughout the regime of parameters, our algorithm is relatively simple, and appears to be practically 
viable. In section [4] we illustrate the efficacy of our approach on both synthetic data, and on the real-world 
problem of deducing whether two words are synonyms, based on a small sample of the bi-grams in which 
they occur. 

We also note that, as pointed out in several related works fiUEJISl, this hypothesis testing question has 
several applications to other problems, such as estimating or testing the mixing time of Markov processes, 
and our results yield improved algorithms in these settings. 

1.1 Related Work 

The general question of how to estimate or test properties of distributions using fewer samples than would be 
necessary to actually learn the distribution, has been studied extensively since the late '90s. Most of the work 
has focussed on “symmetric” properties (properties whose value is invariant to relabeling domain elements) 
such as entropy, support size, and distance metrics between distributions (such as l\ distance). This has 
included both algorithmic work (e.g. 0 [6j j7. [HE JT3J 20, 211 [26] [271 29] 128]), and results on developing 
techniques and tools for establishing lower bounds (e.g. l23l 130 .261). See the recent survey by Rubinfeld 
for a more thorough summary of the developments in this area 1241 ). 

The specific problem of “closeness testing” or “identity testing”, that is, deciding whether two distri¬ 
butions, p and q, are similar, versus have significant distance, has two main variants: the one-unknown- 
distribution setting in which q is known and a sample is drawn from p, and the two-unknown-distributions 
settings in which both p and q arc unknown and samples are drawn from both. We briefly summarize the 
previous results for these two settings. 

can largely be explained by discrepancies in the sample sizes of the respective studies, as opposed to differences in the actual 
distributions of rare mutations across demographics, highlighting the importance of improved statistical tests in this “undersampled” 
regime. 
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In the one-unknown-distribution setting (which can be thought of as the limiting setting in the case 
that we have an arbitrarily large sample drawn from distribution q, and a relatively modest sized sample 
from p), initial work of Goldreich and Ron llT2l considered the problem of testing whether p is the uniform 
distribution over [n], versus has distance at least e. The tight bounds of 0 (\Jn / e 2 ) were later shown 
by Paninski ll22l . essentially leveraging the birthday paradox and the intuition that, among distributions 
supported on n elements, the uniform distribution minimizes the number of domain elements that will be 
observed more than once. Batu et al. |j7] showed that, up to polylogarithmic factors of n, and polynomial 
factors of e, this dependence was optimal for worst-case distributions over [n]. Recently, an “instance- 
optimal” algorithm and matching lower bound was shown: for any distribution q, up to constant factors, 
max{^, e~ 2 | \q~ I 2 / 3 } samples from p are both necessary and sufficient to test p = (/versus \\p— qj| > e, 

where ||</Z^ 1 “ll 2/3 < IMI 2/3 i s the 2/3-rd norm of the vector of probabilities of distribution q after the 
maximum element has been removed, and the smallest elements up to 0(e) total mass have been removed. 
(This immediately implies the tight bounds that if q is any distribution supported on [n], 0(\/n/ e 2 ) samples 
are sufficient to test its identity. 

The two-unknown-distribution setting was introduced to this community by Batu et al. 0 (refer 
to |0 for the journal version), and using collision statistics, they proposed an algorithm that requires 
rn = 0(c~ s / 3 n 2 ^ ,! logn) samples from each distribution. Later, Valiant ff30l proved a lower bound of 
m = 0(n 2/:! ), which was tight up to logarithmic factors in n. Recently, Chan et al. ||9j determined the 
optimal sample complexity for this problem: they showed that m = 0(max{?r 2//3 /e 4 / 3 , y/n/e 2 }) sam¬ 
ples are necessary and sufficient for closeness testing, up to constant factors. In a slightly different vein, 
Acharya et al. [31E1 recently considered the question of closeness testing with two unknown distributions 
from the standpoint of competitive analysis. They proposed an algorithm that performs the desired task us¬ 
ing 0(n 3//2 polylog n) samples, and a lower bound of fl(n 7 / 6 ), where n represents the number of samples 
required to determine whether a set of samples were drawn from p versus q, in the setting where p and q are 
explicitly known. 

A natural generalization of this hypothesis testing problem, which interpolates between the two-unknown- 
distribution setting and the one-unknown-distribution setting, is to consider unequal sized samples from the 
two distributions. More formally, given m\ samples from the distribution p, the asymmetric closeness testing 
problem is to determine how many samples, m 2 , are required from the distribution q such that the hypothesis 
p = q versus \\p — q||i > e can be distinguished with large constant probability (say 2/3). Note that the 
results of Chan et al. @ imply that it is sufficient to consider m\ > 0(max{n 2 / 3 /e 4 / 3 , y/n/e 2 }). This 
problem was studied recently by Acharya et al. 0: they gave an algorithm that given mi samples from 
the distribution p uses m 2 = 0(max{ n 3 lo ^i n ; ^/ nl ° gn }) samples from q, to distinguish the two distributions 

with high probability. They also proved a lower bound of m 2 = C(max{ })• There is a polynomial 

gap in these upper and lower bounds in the dependence on n, \Jm x and e. 

As a corollary to our main hypothesis testing result, we obtain an improved algorithm for testing the 
mixing time of a Markov chain. The idea of testing mixing properties of a Markov chain goes back to the 
work of Goldreich and Ron lfl2Tl . which conjectured an algorithm for testing expansion of bounded-degree 
graphs. Their test is based on picking a random node and testing whether random walks from this node 
reach a distribution that is close to the uniform distribution on the nodes of the graph. They conjectured 
that their algorithm had 0(y/n) query complexity. Later, Czumaj and Sohler ifTP . Kale and Seshadhri 
ED, and Nachmias and Shapira lfl~8ll have independently concluded that the algorithm of Goldreich and 
Ron is provably a test for expansion property of graphs. Rapid mixing of a chain can also be tested using 
eigenvalue computations. Mixing is related to the separation between the two largest eigenvalues Ifl4lfl7l . 
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and eigenvalues of a dense nxn matrix can be approximated in 0(n 3 ) time and 0(n 2 ) space. However, for 
a sparse nxn symmetric matrix with m nonzero entries, the same task can be achieved in 0(n(m + log n)) 
operations and 0(n + m) space. Later, Batu et al. © used their - 1\ distance test on the f-step distributions, 
to test mixing properties of Markov chains. Given a finite Markov chain with state space [n] and transition 
matrix P = ((P(x,y))), they essentially show that one can estimate the mixing time T rnix up to a factor 
of logn using 0{rff''T rmx ) queries to a next node oracle, which takes a state x € [n] and outputs the state 
y e [n] drawn from the probability P(x. y ). Such an oracle can often be simulated significantly more easily 
than actually computing the transition matrix P(x, y). 

We conclude this related work section with a comment on “robust” hypothesis testing and distance 
estimation. A natural hope would be to simply estimate ||p — q|| to within some additive e, which is 
a strictly more difficult task than distinguishing p = q from |\p — q\\ > s. The results of Valiant and 
Valiant l26l l27l l29l 28) show that this problem is significantly more difficult than hypothesis testing: the 
distance can be estimated to additive error e for distributions supported on < n elements using samples 
of size 0(n/\ogn) (in both the setting where either one, or both distributions are unknown). Moreover, 
H(n/logn) samples are information theoretically necessary, even if q is the uniform distribution over [n], 
and one wants to distinguish the case that \\p — q||i < ^ from the case that \\p — q||i > jj-. Recall that the 
non-robust test of distinguishing p = q versus | \p — q\\ > 9/10 requires a sample of size only 0(y/n). The 
exact worst-case sample complexity of distinguishing whether \ \p — q||i < -4 versus \p — q\\] > e is not 
well understood, though in the case of constant e, up to logarithmic factors, the required sample size seems 
to scale linearly in the exponent between n 2//;i and n as c goes from 1/3 to 0. 

1.2 Our results 

Our main result resolves the closeness testing problem in the unequal sample setting, to constant factors, in 
terms of the worst-case distributions of support size < n: 

Theorem 1. Given m\ > n 2//3 /e 4//3 and e > ri~ l,/|2 , and sample access to distributions p and q over [n], 
there is an O(mi) time algorithm which takes @(mi) samples from p and m 2 = 0(max{ ^d l g2 , 4^}) 
samples from q, and with probability at least 2/3 distinguishes whether 

\\P ~ q\\i < O (—) versus \\p~q\\i>£. ( 1 ) 

\m 2 J 

Moreover, given 0(mi) samples from p, fl(max{ P2 g2 , ^}) samples from qare information-theoretically 
necessary to distinguish p = qfrom \p — <y| 1 1 > e with any constant probability bounded above by 1/2. 

The lower bound in the above theorem is proved using the machinery developed in Valiant 1301 . and 
“interpolates” between the @(y/n/ e 2 ) lower bound in the one-unknown-distribution setting of testing uni¬ 
formity If 22 l and the 0(n 2 / 3 /e 4 / 3 ) lowerbound in the setting of equal sample sizes from two unknown 
distributions ||9). The upper bound is proved in several steps. We begin by proposing two algorithms for 
the hypothesis testing problem p = q versus | \p — q\ |i > e depending on the value of rri\: the non-extreme 
regime, that is, m\ = 0((n/e 2 ) 1-7 ), and the extreme case where m 1 = Off). In the non-extreme regime, 
our algorithm is an extension of the algorithm proposed in |9]], and is similar to the algorithm proposed in ||3l 
except with the addition of a normalizing term. In the extreme regime when m\ ~ n, we incorporate an 
additional statistic that has not appeared before in the literature 0 

2 We note that a further extension of this algorithm yields a stronger robustness parameter, distinguishing between |\p — q\ |i < 

°( max (v4 T’/k)) versus ||p-g||r > e. 
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As an application of Theorem [T] in the extreme regime when mi = 0(n), we obtain an improved 
algorithm for estimating the mixing time of a Markov chain: 

Corollary 1. Consider a finite Markov chain with state space [n] and a next node oracle; there is an 
algorithm that estimates the mixing time , T m j x , up to a multiplicative factor of log n, that uses 0(ri:^~T nnx ) 
time and cpieries to the next node oracle. 

It remains an intriguing open question whether this query complexity is optimal; we are not aware of 
any lower bounds beyond the trivial < 1 (nT m { x ) ■ 

1.3 Outline 

We begin by stating our testing algorithms, and describe both the intuition behind the algorithms, as well 
as the high level proof approach. Throughout the theoretical portion of the paper, we will work in the 
“Poissonized” setting, where we assume that we have access to Pois(mi) samples from distribution p, 
and Pois(m 2 ) samples drawn distribution q. This assumption that the sample size is a random variable 
renders the number of occurrences of different domain elements independent. Because Pois(A) is tightly 
concentrated about its expectation, both the upper and lower bounds on the sample complexities proved in 
this “Poissonized” setting also hold (up to factors of 1 ± o(l)) in the setting in which one obtains samples 
of a fixed size. 

The complete proofs require rather involved calculations of the moments of the various statistics em¬ 
ployed by our algorithms, and are deferred to Appendix [A] The applications of our testing results to the 
problem of testing or estimating the mixing time of a Markov chain is discussed in Section [3] Finally, Sec¬ 
tion [4] contains some empirical results, suggesting that the statistic at the core of our algorithms performs 
very well in practice. This section contains both results on synthetic data, as well as an illustration of how to 
apply these ideas to the problem of estimating some notion of the semantic similarity of two words based on 
samples of the n-grams that contain the words in a corpus of text. The construction and proof of our lower 
bounds, showing the optimality of our testing algorithms is given in Appendix [Pj 

2 Algorithms for Testing 

In this section we describe algorithms for i\ testing with unequal samples, which give the upper bound in 
Theorem [T] We propose two algorithms depending on the value of mp. the non-extreme regime, that is, 
mi = 0((n/e 2 ) 1-7 ), and the extreme case where m\ ~ n. 

2.1 Algorithms for t x Testing: Non-Extreme Case 

We begin with the basic algorithm (Algorithm [I]), which is optimal in the non-extreme regime, for constant 
e. All the subsequent algorithms are modifications of this basic algorithm. 
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Algorithm 1 Closeness Testing: Non-Extreme Case (The Basic Algorithm) 

Suppose e = (1(1) and m\ = C^n 1 ” 7 ) for some 7 > 0. Let Si, S 2 denote two independent sets of 
Pois(mi) samples from p and let Tj, T 2 denote two independent sets of Pois(m 2 ) samples drawn from q. 
We wish to test p = q versus | \p — q\ |i > e. 

• Let 6 = 2 l )a 1 7 °^ n , and define the set B = {i G [n] : > b} U {i G [n] : > b}, where Xf 1 

denotes the number of occurrences of i in Si, and p 71 denotes the number of occurrences of i in T\. 

• Let Xi denote the number of occurrences of element i in S 2 , and Y r denote the number of 
occurrences of element i in Tj: 


1. Check if 


E 

i£B 



Xl 

m2 


< e/6. 


2. Check if 


E 

ie[n]\_B 


{m 2 Xi - m{Yi) 2 - {m\Xi + mfp) 
X* + Y t 


< C' 7 m 1 ' m2, 


3/2 


for an appropriately chosen constant C 1 (depending on 7). 


3. If ([2]), and ([3]) hold, then ACCEPT. Otherwise, REJECT. 


( 2 ) 

(3) 


The intuition behind the above algorithm is as follows: with high probability, all elements in the set B 
satisfy either pt > b/2, or qi > 6/2 (or both). Given that these elements arc “heavy”, their contribution 
to the i\ distance will be accurately captured by the t\ distance of their empirical frequencies (where these 
empirical frequencies are based on the second set of samples, Sj, Tj. For the elements that are not in set 
B —the “light” elements—we use a modification of the statistic used by Chan at al. @, where the terms are 
re-weighted according to the unequal sample sizes. This is similar to the algorithm proposed in |[3j], where 
instead of ([3]) the authors used the numerator of (|3]) to distinguish the light elements. However, just using 
the numerator only gives an estimate of the 1 2 distance between p and q. The normalization by X.; + p 
in ([3]) “linearizes” the statistic, which gives some estimate of the l\ distance between the two distributions 
for the light elements. Similar results can possibly be obtained by using other linear functions of Xj and Y t 
in the denominator, though we note that the “obvious” normalizing factor of Xj + does not seem to 
work theoretically, and seems to have extremely poor performance in practice. Additionally, the unweighted 
Xi + p normalization is easier to analyze. 

Finally, we should emphasize that the crude step of using two independent batches of samples—the first 
to obtain the partition of the domain into “heavy” and “light” elements, and the second to actually compute 
the statistics, is for ease of analysis. As our empirical results of Section [4] suggest, for practical applications 
one might want to use only the E-statistic of Q, and one certainly should not “waste” half the samples to 
perform the “heavy”/“light” partition. 

To get the optimal dependence on e, the above algorithm needs to be slightly modified. Algorithm [2] 
gives the optimal sample complexity in the non-extreme case, for any e > 2 . We state the algorithm 

here, as the algorithm in the extreme case where m\ ~ n and m 2 ~ pn leverages some of its components. 
The analysis of the algorithm and the proof of the following proposition are given in Appendix |B| 
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Algorithm 2 Asymmetric Closeness Testing: Non-Extreme Case 

Suppose mi = ()((n/e 2 ) 1 7 ) < n for some 7 > 0 . Let Si, .S'2 denote two independent sets of Pois(mi) 
samples from p and let T\, To denote two independent sets of Pois(m,2) samples drawn from q. We wish to 
test p = q versus | \p — q\ |i > e. 

• Let b = 257105 n , and // = 256 log "■ an( j ] c t xf 1 denote the number of occurrences of i in Si, and 

£^7712 7712 9 l 

V' 7 ' denote the number of occu rrences of i in T\. 

X S 1 y^l 

• Define the “heavy” set B = {i e [n] : > b} U {i G [n] : ^ > 6}. 

f X ^ y Ti 

• Define the “medium” set M = < i G [n] : 2 / < max{^-, 777} 5 ; b 

• Define the “light” set H = [to] \ (B IJ M). 

• Let Xi denote the number of occurrences of element i in S2, and Y denote the number of 
occurrences of element i in T2: 


1. Check if 


2. Check if 


Vi,:=XV<:=E 

ieB i£B 


mi m 2 


< e/ 6 . 


W M ■■= ~ miY ' f + mfYi] < 


2 , £ 2 , m{m 2 \ogn 


i&M i&M 


3. Check if 


Z H :=Y, Z --'=Y 

i£H ieH 


(m 2 Xi - TOili ) 2 - {m\Xi + m\Yi) 3/2 

X-+ Y- < C 7 to 1 ni 2 - ■ 


Where C 7 is an appropriately chosen absolute constant, dependent on 7. 

4. If ([4]). <J5]), and (| 6 ]) hold, then ACCEPT. Otherwise, REJECT 


(4) 


(5) 


( 6 ) 


Proposition 1. Suppose mi = Oifn/e 2 ) ' 7 ) < n for some 7 > 0, and e > n _1//12 . Then algorithm 0 
takes 0(mi) samples from p and 0(max{ 72 ^ ; ~!f}) samples from q, and with probability at least 2/3 
distinguishes whetherp = q versus \ \p — q\\i > e. 

2.2 Algorithm for £1 Testing: Extreme Case 

For the extreme case, mi ~ n and m 2 ~ 07, the re-weighted statistic Zfi might have large variance, 
necessitating a modification to the algorithm in this extreme case. To see the cause of such variance, consider 
the case where the samples are drawn from the uniform distribution, Unif[n]. By the birthday paradox, we 
might see a constant number of indices i for which Y t = 2, but X % = 0. Such domain elements themselves 
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contribute 0(n 4 ) to the variance of Zh, which is at the threshold of what can be tolerated. The statistic ([ 7 ]) 
introduced below, is tailored to deal with these cases, and captures the intuition that we are more tolerant of 
indices i for which Yj = 2 if the corresponding X, is larger. 

These modifications allow us to solve the closeness testing problem in the ex tic me case. In fact, the fol¬ 
lowing algorithm works whenever f2((n/e 2 ) 8//9+7 ), overlapping with the non-extreme case for 7 G (0,1/9). 


Algorithm 3 Asymmetric Closeness Testing: Extreme Case 

Suppose m\ = fl((n/e 2 ) 8//9+7 ) for some 7 > 0 . Let Si, S 2 denote two independent sets of Pois(mi) 
samples from p and let Tj, T 2 denote two independent sets of Pois(m,2) samples drawn from q. We wish to 
test p = q versus |\p — q \| 1 > e. 


• Define b,b',B, M, H as in Algorithm [2j 

• Let X, denote the number of occurrences of element i in S 2 , and Y, denote the number of 
occurrences of element i in T 2 : 


1 . 

2 . 


REJECT if there exists i G [n] such that Y r > 3 and X, < 


m\£- 


2/3 


10m27T 1 / 3 * 


Check if 


d ._ 1 i Y i - 2 i / 

i£H 


777-9 


<C 1 ^, 
m 1 


where C\ is an appropriately chosen absolute constant. 


(V) 


3. If step (1) is not rejected and Q. ([5]), Q, and (j7]l are satisfied, then ACCEPT. Otherwise, REJECT. 


Proposition [3] below summarizes the performance of the above algorithm. The proof is given in Ap¬ 
pendix [C] 

Proposition 2 . Suppose m\ = Q((n/c 2 )^^' ) for some 7 > 0 and s > n^ 1 / 12 . Then algorithm (jij) 

takes ©(mi) samples from p and 0(max{ X 2 , ^}) samples from q, and with probability at least 2/3 
distinguishes whether p = q versus \ \p — q\ |i > e. 

It is worth noting that one can also define a natural analog of the Rh statistic corresponding to the 
indices i for which Y, = 3, etc., and that the use of such statics improves the robustness parameter of the 
test. 


3 Estimating Mixing in Markov Chains 

Consider a finite Markov chain with state space [n], transition matrix P = ((P(x,y))), with stationary 
distribution tv. The t-step distribution starting at the point x G [n], if. (•) is the probability distribution on 
[n] obtained by running the chain for t steps starting from x. More formally, for A C [n], P-f A) = Pr[X t G 
.4|X() = x] , where (Xo, Xi,..., X t ) arc the steps of the chain. The f-step distribution Ff can be computed 
as a vector matrix product e x P t , where e x G M n is the standard basis vector which has 1 at position x and 
zeros everywhere else. 








Definition 1. The e-mixing time of a Markov chain with transition matrix P = ((P(x, y))) is defined as 
tmix(e) := inf { t G [n] : sup xe[n] \ E y e[n] \ P x(v) ~ *(v)\ < e}- 

Definition 2. The average t-step distribution of a Markov chain P with n states is the distribution P t = 
n Sxe[n] P x(A)> that is, the distribution obtained by choosing x uniformly from [n] and walking t steps 
from the state x. 

As observed by Batu et al. (81, l\ closeness testing can be used to test whether a Markov chain is close 
to mixing after some specified number of steps, to. Here, we note that asymmetric closeness testing (as 
opposed to the case of equal sized samples as employed in (81), yields an improvement in the performance 
of the testing algorithm for Markov chain mixing. 

The algorithm to test mixing proposed by Batu et al. (8j involves testing the t\ difference between distri¬ 
butions P/° and P to , for every x G [n]. The algorithm uses their t\ distance test which draws 0(n 2//3 logn) 
samples from both the distributions P'f and p'° , and has a overall running time of O(n 5p to). However, 
the distribution P fo does not depend to the starting state x and using Algorithm [ 3 ] it suffices to take 0(n ) 
samples from P ta once and ()(\fri,) samples from P' r , for every x G [n]. This results in a query and runtime 
complexity of O(n 3 / 2 io)- 


Algorithm 4 Testing for Mixing Times in Markov Chains 

Given to G I and a finite Markov chain with state space [n] and transition matrix P = ((P(x, y) j), we 
wish to test 


H 0 :t 


mix 




< to, 


versus Hi : t mix (e) > t 0 . 


( 8 ) 


1 . Draw O(logn) samples Si, ..., <S'o(iogn)> cac h of size PoisfC'i to) from the average fo-stcp distribu¬ 
tion. 

2. For each state x G [to] we will distinguish whether 11 P*° — P to \ \ 1 < 0(-^f=), versus \\ P x° ~ pt ° \ \i > £ , 
with probability of error <C 1 /to. We do this by running O(logn) runs of Algorithm[3j with the i-th 
run using ,5/ and a fresh set of Pois(0(e~ 2 \/TO)) samples from P*. 

3. If all to of the £1 closeness testing problems arc accepted, then we ACCEPT Hq. 


The above testing algorithm can be leveraged to estimate the mixing time of a Markov chain, via the 
basic observation that if t m ix(l/4) < to, then for any e, t m ix(e) < iog ^/2 to, and thus t m w (l/\/n) < 
2 log to- • f miv (l/4). Because t mix (1/4) and t m i x (0(l /\/n)) differ by at most a factor of log to, by applying 
Algorithm[4]for a geometrically increasing sequence of tfs, and repeating each test Oflog to + log to) times, 
one obtains Corollary [T] 

4 Empirical Results 

Both our formal algorithms and the corresponding theorems involve some unwieldy constant factors (that 
can likely be reduced significantly). Nevertheless, in this section we provide some evidence that the statistic 
at the core of our algorithms can be fruitfully used in practice, even for surprisingly small sample sizes. 
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4.1 Testing similarity of words 


An extremely important primitive in natural language processing is the ability to estimate the semantic 


similarity of two words. Here, we show that the Z statistic, Z = JA 


{m^Xi-miY-i ) 2 — (rn%Xi+mfYj) 
m^ 2 m2{Xi+Yi) 


which is 


the core of our testing algorithms, can accurately distinguish whether two words are very similar based on 
surprisingly small samples of the contexts in which they occur. Specifically, for each pair of words, a, b that 
we consider, we select m\ random occurrences of a and m 2 random occurrences of word b from the Google 
books corpus, using the Google Books Ngram Dataset]^] We then compare the sample of words that follow 
a with the sample of words that follow b. Henceforth, we refer to these as samples of the set of bi-grams 
involving each word, although for convenience, we only considered the bigrams whose first word was the 
word in question. 

Figure [T] illustrates the Z statistic for various pairs of words that range from rather similar words like 
“smart” and “intelligent”, to essentially identical word pairs such as “grey” and “gray” (whose usage differs 
mainly as a result of historical variation in the preference for one spelling over the other); the sample size of 
bi-grams containing the first word is fixed at mi = 1 , 000 , and the sample size corresponding to the second 
word varies from m 2 = 50 through m 2 = 1, 000. To provide a frame of reference, we also compute the 
value of the statistic for independent samples corresponding to the same word (i.e. two different samples of 
words that follow “wolf”); these are depicted in red. For comparison, we also plot the total variation distance 
between the empirical distributions of the pair of samples, which does not clearly differentiate between pairs 
of identical words, versus different words, particularly for the smaller sample sizes. 

One subtle point is that the issue with using the empirical distance between the distributions goes beyond 
simply not having a consistent reference point. For example, let X denote a large sample of size mi from 
distribution p, X' denote a small sample of size m 2 from p, and Y denote a small sample of size m 2 from a 
different distribution q. It might be tempting to hope that the empirical distance between X and X' will be 
smaller than the empirical distance between X and Y. As Figure [2] illustrates, this is not always the case, 
even for natural distributions: for this specific example, over much of the range of m 2 , the empirical distance 
between X and X' is indistinguishable from that of X and Y, and yet, as our statistic easy discerns, these 
distributions are very different. 

This point is further emphasized in Figure [3] which depicts this phenomena in the synthetic setting where 
p = Unif [5,000] is the uniform distribution over 5, 000 elements, and q is the distribution whose elements 
have probabilities (1 ± e)/5000, for e = 1/4. The right plot represents the empirical probability that the 
distance between two empirical distributions of the samples from p is larger than the distance between 
the empirical distributions of the samples from p and q: the left plot represents the analogous probability 
involving the Z statistic. In both plots, mi ranges between n 2/3 and n, and m 2 ranges between n 1 / 2 and n. 


for n = 5, 000. 


The Google Books Ngram Dataset is freely available here: 

books/datasetsv2.html 


http://storage.googleapis.com/books/ngrams/ 
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Similarity Between Pairs of Words 


Our Z statistic Empirical Distance 




Figure 1: Two measures of the similarity between words, based on samples of the bi-grams containing each 
word. Each line represents a pair of words, and is obtained by taking a sample of mi = 1, 000 bi-grams 
containing the first word, and m 2 = 50,..., 1,000 bi-grams containing the second word, where m 2 is 
depicted along the a; -axis in logarithmic scale. In both plots, the red lines represent pairs of identical words 
(e.g. “wolf/wolf’,“almost/almost”,...). The blue lines represent pairs of similar words (e.g. “wolf/fox”, 
“almost/nearly”,...), and the black line represents the pair ”grey/gray” whose distribution of bi-grams differ 
because of historical variations in preference for each spelling. Solid lines indicate the average over 200 
trials for each word pan - and choice of m 2 , with error bars of one standard deviation depicted. The left plot 
depicts our statistic, which clearly distinguishes identical words, and demonstrates some intuitive sense of 
semantic distance. The right plot depicts the total variation distance between the empirical distributions— 
which does not successfully distinguish the identical words, given the range of sample sizes considered. The 
plot would not be significantly different if other distance metrics between the empirical distributions, such as 
f-divergence, were used in place of total variation distance. Finally, note the extremely uniform magnitudes 
of the error bars in the left plot, as m 2 increases, which is a result of the Xi + Y t normalization term in the 
Z statistic. 
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Similarity Between Pairs of Words 


Our Z statistic Empirical Distance 




Figure 2: Illustration of how the empirical distance can be misleading: here, the empirical distance be¬ 
tween the distributions of samples of bi-grams for “wolf/wolf” is indistinguishable from that for the pair 
“wolf/fox*” over much of the range of m 2 ; nevertheless, our statistic clearly discerns that these are signif¬ 
icantly different distributions. Here, “fox*” denotes the distribution of bi-grams whose first word is “fox”, 
restricted to only the most common 100 bi-grams. As in Figure [T] mi = 1, 000, and m 2 ranges from 50 to 
1,000, with solid lines depicted the average of 200 trials, and error bars depicting one standard deviation. 
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Figure 3: A comparison of the Z statistic versus the empirical distribution for distinguishing whether two 
samples of respective sizes mi, m 2 , were both drawn from distribution p := Unif [5,000], versus one sample 
being drawn from p and the other drawn from a distribution q in which domain elements have probability 
(1 ± e)/5000, for e = 1/4, and hence ||p — g|| = 1/4. The color signifies the fraction of 120 repetitions 
for which the statistic correctly distinguishes these cases, as mi varies between n 2//,! and n, and m 2 varies 
between n 1 / 2 and n. 
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A Expectation and Variance Bounds 


Before beginning the analysis of the algorithms we need bounds on the expectation and variance of the 
different statistics used in the algorithms. Throughout this section, fix any set A C [n], and let X t denote the 
number of occurrences of the i-th domain element in set S-j —a set of Pois(mi) samples from distribution 
p, and analogously let Y, denote the number of occurrences of the i-th domain element in set T 2 —a set of 
Poi.s(m9j samples from distribution q. Throughout this section, we bound the moments of the following 
statistics: 

• Vt = £^V- = £ ieA 

• W A = £ ie n Wi = E iS A (( m 2 Xi - miYi) 2 - ( m\Xi + mfV;)) . 

^ (m2Xi—miYi) 2 —(m\Xi+m\Yi) 

• — Z^ieA — -EieA Xi+Yi 

A.l Expectation and Variance of Va 
Lemma 1. For any fixed set A C [n] 


2k. _ Ti_ 

mi m 2 


X I Pi - Qi\ < E[Va] < X I Pi ~ + 

ieA ieA 


mi m2 J 


< X \ p i ~ + 

ieA 



(9) 


and 

Var[Vn] < — + —. 

m 1 m 2 


( 10 ) 


Proof. For the lower bound on the expectation, note that E 
To prove the upper bound, observe that 


"1 2k 

- ^-l" 

> 

E 

'2k 

- -^-1 

1 mi 

m 2 1 



mi 

m 2 


I Pi ~ Qi 


E[lf ] = - qf) 2 . 

mi m 2 


By the Cauchy-Schwarz inequality. 


E 


E 17 - 

LieA . 


<X* - Xift-«i+X 

ieA ieA ieA 


f Pi | Qi \ 

V mi m 2 J 


< 5>-*i + ( h + i ' 4|V 


ieA 


mi m 2 


Finally, Var[Vk] = EieAW ] - E[V-] 2 ) < 


21 ri/: 12v ^ E^ g a Pi _|_ E^g a h ^ 1 , 1 


< ± + ±:- 


m 2 — mi m2 


( 11 ) 

□ 


A.2 Expectation and Variance of Wa 

For A C [n], define IV 4 = EieA Wi = EieA( m 2 ^i — miE ) 2 — (m^Aj + m 2 V). Using the facts that 
Xi ~ Pois(mipj) and V) ~ Pois(m 2 ®) and plugging in the expressions for the moments of Poissons, the 
following lemma follows immediately: 
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( 12 ) 


Lemma 2. For any A C [n], Wa/ (m\ mff) is an unbiased estimate of \ \pa — qA \\ 2 • Namely, 

E[W a ] = m\ml - qi ) 2 , 


ieA 


Moreover, 


Var [Wa] = 2mi?n2 X] + 4m i m 2 X ~ ®) 2 
ieA ieA 


where Zi = m 2 pi + m\qi. 


(13) 


A.3 Moments of Za 

Recall that 


Zi : = 


( m 2 Xi - miYi) 2 - (m%Xi + m 2 Yi) 


X t + Yi 

and for A C [n], Za ■= YlieA Z i- show that if p = q, then E[^ ?e4 Z,] = 0, and otherwise, we give a 
lower bound on the expectation of the sum: 

Lemma 3. If p = q, then E[£h gA Z,] = 0, and otherwise, E[£h eA z i\ > • 

Proof. Conditioned on the denominator, 


Xi 


Xi + Yi = a ~ Bin a , 


mipi 

mAPi + m. 2 qi 


Set ^ = rril 'p l +f lr , q ■ Then using binomial moments we get, 

E[(m 2 Xi - m{Yi) 2 \Xi + Y i = a] = apfl - Pi) (mi + m 2 ) 2 + a 2 (m 2 Pi - mi(l - Pi)f 


= (mi + m 2 ) 2 aPi(l - Pi) + a 2 


m\ 


m\ + m 2 


-Pi 


.(14) 


Similarly, 


E[m 2 Xi + mlYi\Xi + Yi = a] = m 2 a + (m| - m 2 )E[Xi\Xi + Yi = <j\ 

= m 2 a + (m| — m 2 )aPi 


Therefore, the conditional expectation of the numerator is 


E 


m 2 Xi - m{Yi ) 2 - (m\Xi + tTrfYf) 


Xi+Yi = a 


= (mi + m 2 ) 2 a(a - 1) 


mi 


= ala — 


^ f mim 2 (qi - Pi) 
V rriiPi + m 2 qi 


mi + m 2 
2 


-A 


(15) 


This implies 


E 


X Zi/m\m% 

L i&A 


E 

i£A 


(qt - Pi f 


1 - 


1 - e - * 4 


Zi 


where Zi = rri\p l + m 2 qi. This implies that the expectation of the sum is zero if p = q. Let g(z) = 
z/( 1 — 1-6 z ). Now, using the fact that g(z) < 2 + z and the Cauchy-Schwarz inequality, the result 
follows. □ 
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Lemma 4. For i E [n] and p = q, 

Var[Zj] < 2m\m 2 Pr[X, + > 0], and hence Var[Z^] = 0(m\m\). 


For pi > qi, Var [Zi] < 0(mfm 2 Pi), and for pi < qi 


Var [Z{] < 0(m\m\ ) min 



( 16 ) 


Proof. The variance of Z, can be computed by using the formula for conditional variance. Define, 
Gi{a) := Var[(m 2 Xj - miYi) 2 - (m 2 7Q + m 2 li)|Xj + Yi = a}. 

Let fa = Using formulas for binomial moments the conditional variance 

Gi{a) = Fi(cr) + Li(cr), 


where 

Fi(a) = 2/3 2 (l—/3i)V(CT—l)(mi+m 2 ) 4 , U(a) = 4/3;(l-/3j)cr(cr-l) 2 (mi+m 2 ) 4 ( —^- Pi] . 

\rai + m 2 J 

For pi = qi, Pi = and Lj(cr) = 0. Also, from the proof of Lemma |4j it can be seen that 

Var[E[Zj|Aj + Y % = a]] = 0, whenp* = qi. Therefore, forp* = qi. 


Var[Zj] = E [Gi(a)/a 2 ] = E[Fi(a)/a 2 ] < 2 m\m\ PrpQ + Y t > 0]. 


Let Zi = m\Pi+rri 2 qi. Then Pr[JQ+Ti > 0] = 1 — e~ Zi < Zi, and Var [Za] = ^ igj4 V aT l^i] = 0(m\m\). 

To prove the bound in the case pi f qi, note that Ffa) = 0, for a = 0,1 and Ffcj) < 2/3?(1 — 
Pi) 2 a 2 (mi + i n 2 ) 4 , for cr > 2. Therefore, 


E 


Fifr) 


(T 


< 2 (mi + m 2 fPi( 1 - Pi) Pr[cr > 2] 


< 2(mim 2 ) 2 (mi + m 2 ) 4 


< 0(mfm 


4 - -^e 2i ) 


(17) 


Now, for pi > qi, za > mi + mz (p { + qi), and 
E 




cr 


< 0(m\m\) 




(pi + qi) 3 


< 0(m\m\pi). 


The remaining terms in the variance can be bounded similarly, and for pi > qi, it follows that Var[Zj] < 
0(m\m\pi). 

For the case p, < qi, use the bound Zi > mipi in ([17]) to get 


E 


Fi(a) 


17‘ 


< Ofmfm.n) min < —, miq. 

Pi 


Qi 


(18) 


17 















Similarly, L^a) = 0 for a = 0, 1 and L*(cr) < 4A(1 - A)c 3 ( mi + m 2 ) 4 (^ m ™+ m2 - A) ■ Therefore, 
for the case pi < y ( , using the bound z'f > m\m2pfqi, for Zi < 1 , and zf > mim 2 p*5j, for z, > I we get 


E 


Li(cr) 


(T- 


< 4(mi+ m 2 ) 4 A(l - A) 


m i 


mi + m 2 

2'Piqi{Pi - qi) 2 Zi( 1 - e _Zi ) 


-A E[cjl{a>2}] 


= 4m 4 m2(mi + m 2 ) 


.. . q. 

= 0 {m\m 2 ) min ^ —, mig 2 - 

. Pi 


(19) 


Finally, from Lemma [3] when pi < y, 

Var[E[Z *|Xi +1} = cr]] = (mi + m 2 ) 2 Var[cr] 

4 4 (?i - Pi) 4 

= 7711 7719-o- 

" Z i 

< 0(771^7112) min < —,77115? 


Till 


771i + 77l 2 


A 




Combining (fi~8|). (p~9|), and (20), the variance (|T6j) follows. 


( 20 ) 

:□ 


For the analysis of the algorithms we also need bounds on the s-th moment of Z A corresponding to a set 
A with the property that for all i € A,pi < 2b' and q, < 2b', where b' = 2 0 ^° g n , as define in Algorithm^ 

Lemma 5. For any s£N, and set A C [n] such that for all i £ A, p t < 21/ and qi < 2b', 

E[\Z A -E[Z A \\ s ]<d s (mj s m 2 ), 

where O s suppresses factor of log°^ n. 

Proof. Trivially, |Z,;| < 3 m)Xi + 3 m} Y t . Since E[X?] is a degree s polynomial in rn\p,, E[Xf] = 
Osfa&xlm.fpj , Tiiipi}). Similarly, for E[T/] = OAmaxjTii.lyf, 777,25*}). Therefore, for i 6 i, 


E[\Zi\ s ] = O s (ml s E[Xf] 

+ mj s E[Y i s ]) = O s (m\ s m 2 max{p*, 5*}). 

(21) 

Similarly, E[ Z* ] s = O s (m\ s m 2 max { 71 *. 5*}), and 


E[\Z A -E[Z A ]\ a ]<O s 

( E[\Zi\ s ] + E[|Z*|] S ) < O s (m\ s m 2 ). 

(22) 


\i£A J 


Combining (|2T|) and (261 yields the lemma. 


0 


For the analysis of the algorithm in the extreme case, we will bounds on the s-th moment of Z A corre- 

2/3 

sponding to a set A with the property that for all i £ A, 2 om 2 n 1 / 3 — Pi — 26' and qi < 2b'. In this case, a 
more careful analysis gives a better bound on moments of Z A . 
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Lemma 6 . For any sGN, and set A C [n] such that for all i £ A, -f- — jja — Pi < 2 b' and q t < 2b', 


20rri2n 1 


e[|z a -e[z a ]| s ] <o 


n s/ 3 m | m s+i 

f 2s/3 


where O s suppresses factor of log°^ n. 

Proof. From the definition Z r , 

(rr%Xl + m\Y? 

Zil - 0 V Xi + Y, 

Conditioned on Xj + Y t = a, Xj ~ Bin(a, m\Pi/zf) and Yj ~ Bin(u, m^qi/zj), where Zj = m\Pi + m 2 qj. 
Then, E [Xj] = am 2 qj/ Zj := Xj, and for any s > 1, 

E[X?|2Q + Yi = a] = 0(max{xj,x|}). 



Similarly, 

E[Y/| X* + Yi = cr]= 0(max{j/i, y?}) where E[Yj] = am 2 qi/zi := y*. 
Therefore, for cr > 0, 


E[|Z i | a |X i + Y i = ff] < O s 
< O s 


^max j 

f m\ s rrilf q^ s <J S 

m\ s m 2 qi \ 

i : 

(J s ~ l zl +l J 

^max j 

f m'f s rri2 S qf s a s 

m\ s m 2 qi \ 

[ ’ 

4 +l J 


Note that E[cr] = z t and E[cr s ] = O s (zf) because Zj> 1 by assumption. Using q. t < 2b' we get 

/ m\ s ml s q* s \ < /^ _ - / mfm 2 g» 

A z t ) ~ * \ Pt ) ~ 8 \ Pi ) V Pt 


Moreover, because rri \ p t > 1, 


Os 


mfm 2 qi\ ^ ^ 

A 1 ~ S 


m 


’i 1 m, 2 Q'A . „ (m s 1 m 2 q: 


Combining (24 1 and (251 with (231 and using pi > 


_S+1 


f 2/3 




Pi 


20m2n 1 t 3 


(since j e 4) gives 


I z,\‘\ <dj < 6 , f 


V pi 


-2s/3 


Similarly, it can be shown that E[|Zj|] s = O s ( -——~ )> and 


E[|Z A — E[Z a ]| s ] < I ^ E[|Zj| s ] + E[|Zj|] s J <0 

VieA / 


n s / 3 mfm2 +1 

f 2s/3 


(23) 


(24) 


(25) 


( 26 ) 


completing the proof of the lemma. 


□ 
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B Proof of Proposition [I] 


We begin by establishing that, with high probability over the first set of samples, Si, Tf, the sets B, M, H 
successfully partition the elements in the “heavy”, “medium”, and “light” sets. This proof follows from a 
union bound over Poisson tail bounds. The proof of Proposition [T] will then proceed by arguing that, with 
high probability over the randomness of the second set of samples, S 2 , T 2 , the algorithm will be successful, 
provided that the sets B. M, H. were a reasonable partition. 

Definition 3. Let b, b' be as defined in Algorithm [2] The set B is said to be faithful if for all i G If p r > b/2 
or qi > b/2. Similarly, M is said to be faithful if for all i <G M, b' / 2 < maxjpj, q, } < 2b. Finally, H is 
said to be faithful if pi < 2b' and q t < 2b', for all i G H. 

Lemma 7. With probability at 1 — o(l/n) over the randomness in the samples Si, T\, the sets B, M, and 
H will be “faithful”. 


Proof. We leverage the following Chemoff style bound for Poisson distributions: for any A < c, and 5 G 

( 0 , 1 ), 

Pr [| Pois(A) - A| > -5c] < 2e ~ 52c / 3 . 

Let Xp 1 denote the number of occuiTences of i in the Pois(mi) samples, Si, drawn from p, and Yp 1 denote 
the number of occuiTences of i in the Poisfma) samples from q that comprise Ti. For any domain element 
i with probability p t > b'/ 2, 


Pr 



1 

mipi\ > -mipi 


< 2e~^ miPi < 2e _201ogn 


o(l/n 2 ). 


Similarly, for any domain element i with probability q t > b'/2, 


Pr 



1 

m 2 qi\ > -m 2 qi 


< 2e~^3 m2qi < 2e _201ogn 


o(l/n 2 ). 


So far, this ensures that common elements do not occur too infrequently. To ensure that none of the rare ele¬ 
ments occur too frequently, note that the same bound implies that for any domain element i with probability 

Pi < b'/2, 


Pr 


Xp > b'mi 


< Pr 



mipi| > b'mi/2 


< 2e~ b ' mi ^ < 2e -201ogn 


o(l/n 2 ). 


Analogously for any domain element i with probability qi < b’/ 2, 


Pr 


Yp 1 > b'm 2 


X Jr'l’ 






Note that if, for all domain elements i with p t > ///2, X^' 1 — rri}p h < \mipi, and for ah elements i 
with pi < b'/2, Xp 1 < b'rri \, and the analogous statements hold for q, and Yp 1 , then the sets B. M, and H 
will ah be “faithful. By our above bounds, and a union bound over the n elements, with probability at least 
1 — o(l/n) this occurs. □ 

We now prove the correctness of Algorithm ([2]) by establishing that in the case that p = q, the algorithm 
will output ACCEPT with probability at least 2/3, and in the case that ||p — q||i > £ the algorithm will 
output REJECT with probability at least 2/3. The analysis of these two cases is split into Lemmas [8] and [l2| 
Together with Lemma [7] this establishes Proposition [T] 
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b.i lb-g||i = o 

We analyze the statistics of the algorithm in the case that p = q, with respect to the randomness in the 
samples S 2 , T 2 under the assumption that the sets B, M, H are faithful. 

Lemma 8. Given that the sets B. M, and H are “faithful” and that p = q, then with high probability over 
the randomness in S 2 , T 2 , AIgorithm [2] will output ACCEPT. 


Proof B.1.1 The statistic V B : 

By Lemma [l] 

Wb\ < 


2\B\ \ 1/2 / 2\B\ ^ 1/2 

+ V Pi - ® = - 

i£B v 2 


m 2 


128 log n 


From our definition of “faithful”, every element of i e B must have pi + qi > b/2 = - g2 m 

I Dl 2 e 2 mo ^ e 2 mg nnr i 
\ JD \ — 128logn ^ 641ogn’ dIlu 


hence 


E[V b ] < 


2\B\ \ 1/2 v/2 

- ) < e — . < £ /8, for n > 2. 

m 2 J ~ 8^ogn 


From Lemmaflj Var [V B ] < ^ ^ ^ = o(e 2 ). Flence by Chebyshev’s inequality, Pr[Ve > e/6] < 




o(l), and hence the first check of Algorithm [2] will pass. 

B.1.2 The statistic Wm- 

FromLemma[2j E \Wm] = rn\m\ f2 ieM (Pi ~ hi) 2 = 0- Additionally, 

Var [Wm\ = 2m\m\ y/ {m 2 Pi + miqi) 2 < 2m\m\ • ma x{m 2 Pi + mi®} ')Tfm 2 p l + m^j). 
ieM i 


From the fact that M is faithful, max,; { m 2 pi +m i q r } < O () • and hence we conclude that Var[IV m] = 

Q^ mfm 2 log n ^ 

By Chebyshev’s inequality, and the assumption that e > l/n 1//12 , 

e 2 m\m 2 log n 


Pr 


Wm > 


= o(l), 


and hence the second check of Algorithm [2] will pass. 


B.1.3 The statistic Z B - 

By Lemma pi E[Z//] = 0, and by Lemmaffl Var[Z#] = 0{m\mff). Therefore, by Chebyshev inequality 
Pi \Z H > m 2 ] < O(-jkz), which can be made arbitrarily small for a sufficiently large constant C 7 , 

and hence the third check of Algorithm [2] will pass. □ 
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B.2 \\p-q\\i>£ 

We now consider the execution of the algorithm when | \p — q\ | i > e. 

Lemma 9. Given that the sets B,M, and H are “faithful” and that\\p—q\\i > e, then with high probability 
over the randomness in S 2 .T 2 , A Igorithm [2] will output REJECT. 

Proof. The proof proceeds by considering the following three cases, at least one of which holds: 1) YlieB \p> ~ 
Qi\ > e/3, 2) E* s m I Pi ~ 9*1 > e/3, and 3) \pi - qi\ > e/3. 

B - 21 >e/3 

By LemmaflJ E[Vb] > Eigb \Pi~hi\ > e /3 and Varfly/ < 4 - -L < 2/ fTi Therefore by Chebyshev’s 

inequality, Pi )Vb < e/6] = o(l), and hence the algorithm will output REJECT with high probability. 

B - 2 - 2 EieM \v% - Qi\ > e/3 

From Lemma[2j E[JTm] = m 2 m 2 E*e m (Pi “ hi) 2 ■ From the definition of “faithful”, it follows that \M\ < 

2 1 28 > iog n ’ anc * hence by Cauchy-Schwarz, 

(m\ml) ^(Pi - hi) 2 > (rnjrn%)^T 
i&M 

Furthermore, from Lemma|2j 

Var [W M \ < 2m\m\ ^ z 2 + 4m\m\ ^ z t (p t -qf) 2 , 

i&M i&M 

where z. L = rn 1 q, + triypi ■ As in the proof of LcniniajsJ the first term is O ( 1 "y/" g n j. p or the second term, 
noting that E* < m\ + m 2 , and {pi — g *) 2 < Q( ^° g m a ), we get the bound of 0 ( m i r " £ 2 4 log ” ). 

By Chebyshev’s inequality and the assumption that e > l/n 1//12 , with probability 1 — o(l), Wm > 
e 2 mfrn .2 log n, and the algorithm will output REJECT 

B - 2 - 3 Xaeff bi “ Qi\ ^ e/3 

From Lemma I 3 J E[Z//] > (}( n, l "P- ). Using the assumption in the statement of Proposition [l] that m 2 = 
we conclude that 

E[Z#] = J7(m^ 2 m2). 

Using the moment bounds from Lemma[5]and the definition of “faithful”, for any integer s > 0, E[| Zh — 
E[Zf/]| s ] < O s {m\ s m, 2 )- By Markov’s inequality, 

P t[Zh < C^m^mf) < 

< 


Pr 

Pr 

O, 


\Z h -E[Z h }\ >U(mf 2 m 2 ) 

|Z H -E[Zff]| a >n(m? i/ 2 m|) 


m\ s m2 


m 


3s/2 


= 0 , 


mf 


TTtn 


m. 


s—1 


b» - <&!)' 


|M| 


> (m 2 m| 


128 £ 2 logra 
187712 


> 7 £ 2 m\m 2 log n. 
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As long as ^ < 1 jn c for some positive constant c, there will be some integer s c , dependent on 

m 2 

c for which this probability is o(l). Note that the stipulation in the proposition statement, that mi = 
O (in/ e 2 ) 1-7 ) , for some constant 7 > 0, ensures that ^ = 0(l/n~ 27 ), and hence the algorithm will 
output REJECT with probability 1 — o(l) in this case. □ 


C Proof of Proposition [2] 

In this section we prove Proposition [2] showing that Algorithm [3] performs as claimed in the extreme case 
where m\ ~ n. The algorithm is a slight modification of Algorithm Q, tailored to handle the imbalance be¬ 
tween the sample sizes from p and q. We prove that this algorithm works whenever mi = fl((n/e 2 ) 8 ^ 9+7 ) 
for some 7 > 0, and overlaps with the regime of parameters for which the non-extreme algorithm, Algo¬ 
rithm^ will succeed. 

We begin the proof of the above proposition by considering the statistic Rh- 
Observation 1. Define Ra = EieA 1 x"+i ^ >f or A C [n]. Then 


E[R A \ = Y 
2=1 


m 2 c li (1 — e~ miPi ) e~ m2qi 

2m±pi 


(27) 


Proof. Since X t ~ Pois(mip*), E[ x , J = 


1—e 


np* 


Also, Yi ~ Pois(m 29 i) implies Pr[T) = 2] = 


\-Xi+li miPi 

2 ... ~ rn2 i\ The expectation of Ra now follows from linearity of expectation and the independence of 

Xj and Yi. □ 


(moqi) 2 „-r 


As mentioned before, in the extreme case the statistic Za can incur a variance of 0(n 4 ), which is at the 
threshold of what can be tolerated. The statistic Ra is tailored to deal with these cases. This is formalized 
in the following lemmas: whenever the variance of Za is at least the tolerance threshold OJriifm/), the 
expected values of Ra in the case p = q is well separated from the likely values of Ra in case | \p — q \| 1 > e. 

2 

Lemma 10. If p = q, E[.Ra] < 7 ^-. If p / q and rnax,; e A q, < ^ and VarfZ/t] = then 

E[7?a] > 


Proof. U p = q. then 


E[i2 A ] 


m'2 \ - 
2mi 

ieA 


q? (1 — e ~ m iPi ) e ~ m 2 qi 
2pi 


< 


sL < 

2mi 2 pi ~ 2m\ 

i£A 


Now, suppose p q. Let 


A) := {i G A : m.\Pi > 1/2}. 


Note that VarfAt] > fl(mfm 2 ) implies that either Y^ieA 0 


constant C (since by Lemma 
separately: 


4 


Var [Z A ] < E iS A 


2 

% > C or mi Eiezt\A 0 > C for some 
min < —, rri 1 qf >). We consider the two cases 
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1 Suppose p" — Since Qi — 10/ 777,2 for all i £ A, it holds that for i £ Ao,e m2Qi > e 

Moreover, i £ Aq implies 1 — e~ miPi > 1 — e -1 / 2 . Therefore, 


-10 


E 

ieA 0 


m 2 q 2 (1 — e miPi ) e m29i e 12 ? 77 .| Qi ^ C ■ e 12 m| 


2mipj 


V^> 

mi itr 0 pi 


mi 


2 Suppose mi X^eA\A 0 Qi > C. Using the inequality 1 — e x > x — x 2 /2, 


E 

i£A\A 0 


m 2 2 qf (1 - e~ mip ‘) e~ m2qi 
2m\pi 


> 


e 10 m| 
2 mi 


Qi rum 


M 


Pi 


> 


e 10 m ,2 
2mi 

e~ 10 m,o 


E 

ieA\Ao 
Y (mig 2 - mjqfpi/2) 


ieA\A 0 


Y (Vi-tf/ 4 ) 


ieA\A 0 


e 10 mn 


x 2 C ■ 3e 10 mn 

Y 3 ^ / 4 ^- 


8 


ieA\A 0 

where the second to last inequality uses that assumption that m\Pi < 1/2 for i £ A \ Aq. 
Combining the above cases it follows that E// 4 ] > U(m|/mi). 


□ 


From the proof of the above lemma it is clear that we can choose some absolute constant K such that 
whenever if p / q and 

max \qj\ < 10 /m 2 , Var [Za\ > iCmfm|, (28) 

i£A 


then E[1?a] > llmj^mi. Hereafter, fix this constant K. 


C.l p = q 

Suppose, mi = U((n/e 2 ) 8 / 9+7 ) for some 7 > 0. We analyze the statistics in Algorithm [ 3 ] in the case that 
p = q, with respect to the randomness in the samples S/ p 2 under the assumption that the sets B, M, H are 
faithful. 


Lemma 11. Given that the sets B, M, and H are “faithful” and that p = q, then with high probability over 
the randomness in S 2 , T 2 , A Igorithm [/] will output ACCEPT. 

Proof. From calculations identical to those in case |B.1.1[ |B.1.2 it follows that 

Pr [Vb > e/6] < Pr Wm > £ < _L ; Pr [Z H > C , 2 m? /2 m 2 ] < 


when p = q. Therefore, the unknown distributions will pass the checks in Algorithm [3] that correspond to 
the statistics Vb, Wm, and Zh- 

It remains to verify the additional two checks in Algorithm [3] 
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C.1.1 Check (1) in Algorithm [3] 

To show that the first check in Algorithm [3]passes, we will show that when p = q, 


Pr 


there exists i £ \n] such that Y, > 3 and X, < 


miE 


2/3 


lOm^n 1 / 3 


< 1/50. 


/3 f \ 

x/3 = 17 I 1 4/3 — 1 = fl(n 7 ) for some constant 7 > 0, since by assumption, m\ = 
fl((n/e 2 ) 8 / 9+7 ) for some 7 > 0. 


Denote A = ™ l£ ' - 

Wrri2n 


If pi > 77. Then Pr [Xi < A] < Pr[Pois(2A) < A] = o(l/n 2 ). On the other hand, if pj = (^ < 77, 


then 


mi 


Pr[Y t > 3] < Pr 


Pois > 3 

V m i / 


= Pr 


Pois (^i) 


> 3 


< 


lOOn 


Hence by a union bound over all i £ [n], check (1) in Algorithm [3] passes. 


C.1.2 The statistic R 


Recall that H = [n]\(H U M), where B and M are defined in ([2]). Note that by Lemma 10 when p = 


E[R h ] < 


9 

™ 2 

2mi 


Recall that rn 2 /m\ > 1, and the second criteria for Algorithm [^rejecting is Rh > Cm^/mi, for a large 
constant C. Since Rh is a sum of independent random variables, each of which is in the range (0,1), a 
standard Chernoff bound applies, yielding that the probability the algorithm rejects due to this Rh is at 
most 1/100. □ 


C.2 \\p~q\\i > £ 

Lemma 12. Given that the sets B,M, and H are “faithful” and that \p — q\\i > e, then with high 
probability over the randomness in 82 - T 2 , Algorithm^will output REJECT. 

Proof. The proof proceeds by considering the following three cases, at least one of which holds: 1) I Pi~ 

Qi\ > e/3, 2) I Pi ~Qi\> e/3, and 3) Y^ieH I Pi ~ Qi\ > e/3. Now, if either I Pi ~ Qi\ > e/3 or 

YIi&m \Pi ~ Qi\ > e/3, then from calculations identical to those in Sections 
algorithm will output REJECT. 

Therefore, assume that fficH I Pi ~ Qi | > e/3. We begin the proof with the following observation: 
Observation 2. Suppose there exists j £ [n] such that qj > 77 and pj < 20 ^ n 73 ’ t ^ en 


B.2.1 


B.2.2 it follows that the 


Pr 


3i £ [ n\s.t.Yi > 3 and Xi < 


mie 2 / 3 

lOm^n 1 / 3 



(29) 


that is, Algorithm [7]/h/V.v the first check and REJECTS. 

Proof. Given j with q :j > ^ and Pj < 20 ^ 1/3 , Pr[Y) > 3] > 0.99, and Pr 
l-o(l). □ 


Y ^ iriie 2 / 3 

J lOtr^n 1 / 3 


> 
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Given this observation, we may continue under the assumption that for all i G [n] such that q % > 3T, 

2/3 

K > -,» £ ~ t 7 s ■ Now, define 

So := {z G [n] : qi < 10/m 2 }, 

and consider the following cases: 


Case 1 YlieSo I Pi ~ Qi\ — e /6- To begin with suppose that Var[Zs 0 ] < Kmfm^, with K as defined in 

Zs Q ] > CL{m'l /2 m2) 


(28). Then by Chebyshev’s inequality Pr [Zjj < C^rri'l' 2 m^] < 2 ’ 0 (since E 
by Lemma 3). Otherwise, Var[Zg 0 ] > Kmfrri^, in which case, by Lemma 


10 


m So \ > 


11 m; 


2mi 


since i?# > Ks 0 is a sum of independent random variables, with values betweenO and 1, a Chernoff 
bound yields that with probability at least 0.99, Rh will exceed the threshold and the second check 
of Algorithm [3] will fail. 

Case 2 Finally, suppose that J2i£H\s 0 I Pi ~ Qi\ Z e/6. Since q % > IO/ttt .2 for all i £ H \ Sq, it suf- 
fices to assume that p-; > 2 om n i/ 3 - T rom Lemma 6 letting T = H \ Sq. wc have that E\Zj] > 
0(e 2 mf?n|/36n), and 


E[\Z t -E[Z t ]\ s ] = o 


n 


s/3 


m. 


s+l 


-2s/3 


By Markov’s inequality, 

Pr[Z^ < C 7 m^ 2 m. 2 / 2 ] < Pr[|Z-r — E[Zr]| > f2(m^ 2 m 2 )] 


< O.s 


< o s 


n S// 3 mfmn +1 


£ 2s/3 m 3 s / 2 


ms 


n s / 3 rri2 

K £ 2s / 3 m^ 2 / 


(30) 


If m 2 = 


then (30) becomes ( ^ n ^f /2 Vi /2 ) ■ Since m 2 > 0((n/e 2 ) 8 / 9 ), by taking s > 5, we 


can make the probability in (30) o(l). Similarly, if mi = n and m 2 = yTi/e 2 , then with s = 6, (30) 
becomes O s j = o(l) as e > n _ i 2 . Together with the concentration of Zs 0 from Chebyshev’s 
inequality, we get that in this case, the Z statistic check will fail and the algorithm will output REJECT 
with probability at least 0.99 in this case. 


□ 


D Lower Bound for i\ Testing 


In this section, we present lower bounds for the closeness testing problem under the t\ norm using the 
machinery developed in Valiant lf30l [3Tl . To this end, define the (k\, fc 2 )-based moments m(r,s) of a 
distribution pair (p. q) as k\k^ X/=i ////,'■ Valiant lOTl Theorem 4.6.9] showed that if the distributions 
pf , p 2 have probabilities at most 1 /lOOOAq, and pp , pT, have probabilities at most 1/1000/^2, and 

|m+(r, s) - m~(r,s)\ < 1 ^ 

r 2 ~s> l V 1 + ma x{m+(r,s),m~(r,s)} 1000 
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then the distribution pair (Pi,p 2 ) cannot be distinguished with probability 13/24 from (jp . pf) by a tester 
that takes Pois(fci) samples from (pf ,p 2 ) and PoisfA’a) samples from (jp .pi, ). 

Using this we prove the following proposition: 

Proposition 3. Let n 2,/:i /e 4 ' /:! < m 1 < n. Then there exists distributions p and q such that given @(mi ) 
samples from p requires fl( ^ ) samples from q to distinguish between p = q and \ \p — g||i > e with 
high probability. 

Proof Fix 5 = 1/4. Let b = 1/mi and a = C/n, where C is an appropriately chosen constant. Let A, B, 
and C be disjoint subsets of size (1 — 5)/b, 1/a, 1/a, respectively. Consider two distributions 

p = bl A + Sal B , 


and 

q = bl A + <5a(l + ^)1 B , 

where z is 1 or -1 depending on whether the index is even or odd (this is done so that " =1 Qi = 1)- Then 
clearly \\p — g||i = Se = e/4. 

Define k\ = cm \ and k 2 = c£~ 2 n/y/m 1 , where c is a sufficiently small constant. Then ||p||oo = b < 
100 1 0fcl and | \p\ |oo = b < 100 1 0fc ^ , whenever mi > n 2/3 /e 4/3 and b > a. 

Let (p,p\ = (Pi,p 2 ) and ( p , q] = (pf ,pf) and computing the (ki, fc 2 )-based moments gives: 

m+(r, s] = k[k s 2 { 1 - + k\k s 2 8 r+s a^*- 1 , 


and 

m~ (r, s] = k{k s 2 {\ - S)b r+8 ~ 1 + k{k s 2 5 r+s a r+s - 1 ^ ( X + £ Y + i 1 ~ £ ) S ^ _ 

By Theorem 4.6.9 of Valiant ff3Tl . to show that ( k\. kf) samples are not enough, it suffices to have pT| . 
Observe, 

|m+(r, s) — m~ (r, s)| < fc[fc^ r + 8 a r+fl - 1 (l - ^((1 + e) s + (1 - e) 8 )) 

y/l + max{m+(r, s), m~(r, s)} ^k\k 2 ( 1 — 8)b r+s ~ 1 

For any s > 0, define h{e, s] = 1 — ^ 1+c - > ^ 1-c ) Observe that h(e, 1] = 0, and | h(e, s)| < 1, for s / 1. 
Note that mi > n 2//3 /e 4 / 3 , implies that e >n~±. Therefore, for every fixed r > 0 and s ^ 1, 


h{e,s)kjk j6-( r+a - 1 )/ 2 o r+8 - 1 < c^ 




r+s 

< c 2 , 


since mi < n by assumption. This shows ( |3T] ) if c is chosen small enough. □ 

The optimality of the £ A tester, establishing the lower bound in Theorem [T] follows from the above 
proposition together with the lower bound of \fn/e 2 for testing uniformity given in Paninski f22l . 
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